Umbhali:
(1) Vishaal Udandarao, Tubingen I-AI Center, I-University of Tubingen, i-University of Cambridge, kanye nesikhathi esifanayo;
(2) Ameya Prabhu, Tubingen I-AI Center, I-University of Tubingen, i-University of Oxford, kanye nesipho esifanayo;
(3) Adhiraj Ghosh, Tubingen AI Center, University of Tubingen;
(4) Yash Sharma, Tubingen AI Center, University of Tubingen;
(5) Philip H.S. Torr, University of Oxford;
(6) Adel Bibi, University of Oxford;
(7) I-Samuel Albanie, i-University of Cambridge ne-equal counseling, isicelo esungulwe nge-money flip;
(8) Matthias Bethge, Tubingen I-AI Center, I-University of Tubingen kanye ne-equal advising, isicelo esilinganiselwe nge- coin flip.
Authors:
(1) Vishaal Udandarao, Tubingen I-AI Center, I-University of Tubingen, i-University of Cambridge, kanye nesikhathi esifanayo;
(2) Ameya Prabhu, Tubingen I-AI Center, I-University of Tubingen, i-University of Oxford, kanye nesipho esifanayo;
(3) Adhiraj Ghosh, Tubingen AI Center, University of Tubingen;
(4) Yash Sharma, Tubingen AI Center, University of Tubingen;
(5) Philip H.S. Torr, University of Oxford;
(6) Adel Bibi, University of Oxford;
(7) I-Samuel Albanie, i-University of Cambridge ne-equal counseling, isicelo esungulwe nge-money flip;
(8) Matthias Bethge, Tubingen I-AI Center, I-University of Tubingen kanye ne-equal advising, isicelo esilinganiselwe nge- coin flip.
Umbala we-Left
Abstract futhi 1. Ukuhambisana
2 Izinzuzo ezivela ku-Data Pre-Training ne-Quantification Frequency
3.2 Umphumela: Ukubuyekezwa kwe-frequency kubonakalisa ukusebenza kwe-"Zero-Shot"
I-Testing Generalization kuya ku-Purely Synthetic Concept kanye neData Distribution
5 Izinzuzo ezengeziwe ezivela ku-Pre-Training Concept Frequencies
6 Ukubuyekeza i-Tail: Let It Wag!
8 Iziphumo kanye nezinkinga ezivela, Imibuzo, futhi Imibuzo
Part I
Appendix
A. I-Concept Frequency I-Predictive Of Performance Through Prompting Strategies
B. I-Concept Frequency I-Predictive Of Performance Around Retrieval Metrics
C. I-Concept Frequency I-Predictive of Performance ye-T2I Models
D. I-Concept Frequency ibonakalisa ukusebenza kwe-Concepts kuphela kusuka ku-Image ne-Text Domains
Q. Ngaba futhi kanjani usebenzisa i-RAM++?
I-Details of Misalignment Degree Iziphumo
I. Iziphumo ze-Classification: Let It Wag!
abstract
I-Web-crawled pre-training datasets isekelwe "zero-shot" ukubuyekeza ukusebenza enhle ye-multimodal models, njenge-CLIP ye-classification/retrieval kanye ne-stable-diffusion ye-image generation. Kodwa-ke, akungafani ukuthi ingozi ye-zero-shot kuyinto enhle.UkubuyekezwaIsikhathi esisodwa se-Multimodal Models, njengoko akuyona ukuthi kungeningi i-pre-training datasets yayo ibandakanya izinhlelo ezilandelayo ezikhishwe ngesikhathi sokushintshwa kwe-zero-shot. Kule umsebenzi, sincoma:Indlela yokusebenza kwe-multimodal models ku-downstream concepts ku-influenced ngu-frequency of these concepts in their pre-training datasets?
Ukubuyekeza okuhlobene ku-34 amamodeli kanye ne-5 amasethi ze-standard pre-training (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), okwenza i-artifacts angaphezu kuka-300GB yedatha. Sinikeza ngokuqondile ukuthi, ngaphandle kokubonisa ukuhlaziywa kwe-"zero-shot", amamodeli e-multimodal zihlanganisa idatha ezingaphezu kwe-exponentially ukufinyelela ukusebenza kwe-"zero-shot" e-downstream, ngokulandelana ne-sampling engaphakathi ne-log-linear scaling trend. Lezi zihlanganisa ngisho ekubuyekeza ukufinyelelwa kwe-sampling-level phakathi kwe-pre-training ne-downstream datThumela ukuba wag!Ukuhlolwa kwelinye, isifundo lethu ibonise i-exponential need for training data which impends that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
1 Ukulinganisa
Izimodeli ze-multimodal ezifana ne-CLIP [91] ne-Stable Diffusion [96] ziye zibonise ukusebenza kwezimfuneko ezilandelayo — i-CLIP yinkqubo yokufakelwa kwe-imaging ye-zero-shot [133, 72, 126, 48, 132] ne-imaging text retrieval [46, 64, 24, 117, 129], kanti i-Stable Diffusion yinkqubo yokufakelwa kwe-zero-shot ye-text-to-image (T2I) generation [93, 17, 96, 41]. Kulesi siqu, sincoma le ngempumelelo yokufakelwa nge-lens ye-zero-shot generalization [69], okuyinto inikeza amandla ye-model yokusebenzisa ulwazi oluthile lwethu ku-new concepts. Ngakho, sincoma:Ingabe amamodeli ama-multimodals ezidlulele ngokuvamile “zero-shot”?
Ukuze ukuguqulwa lokhu, sinikeza ukuhlaziywa okuhlobene izimo ezimbili eziyinhloko: (1) ukusebenza kwamamodeli phakathi kwezidingo eziningana ezingenalutho kanye (2) ukuhlaziywa kwamamodeli ezivamile ezivela ku-pre-training datasets yabo. Sihlanganisa isitifiketi ephelele we-4.029 amamodeli[1] kusuka ku-27 amodeli ezivela ezivela ku-classification, retrieval, kanye nokuvelisa imidwebo, ukulawula ukusebenza kunezinto eziningana. Ukuhlolwa kwethu lihlanganisa amamodeli amakhulu e-pre-training nge amamodeli ahlukene, izindlela zokuhlanza idatha kanye nezithombe (CC-3M [107], CC-12M [27], YFCC-15M [113], LAModel ukusebenza isilinganiso linear njengoba umlinganiselo concept ku pre-training idatha ukwandisa exponentially Ngena ngemvumeUkubonisa ukuthi le trend log-linear kuyinto enhle ukulawula ngenxa yama-factor correlated (ama-samples e-pre-training kanye nama-test data [79]) kanye nokuvamile phakathi kwezinhlelo ezihlukahlukene zokusebenza kanye nama-samples eyenziwe ngokuphelele ngokwenziwa [51].
Model ukusebenza isilinganiso linear njengoba umlinganiselo concept ku pre-training idatha ukwandisa exponentially
Iziphumo zethu zibonisa ukuthi ukusebenza kwe-empirical enhle ye-multimodal models ezifana ne-CLIP ne-stable diffusion ingatholakala kakhulu ku-presence ye-test concepts ku-premise datasets yayo eningi, ngakho-ke ukusebenza yayo ebizwa ngokuthi-empirical ayikho "zero-shot" generalization. Ngokungafani, lezi amamodeli zihlanganisa idatha emininzi ku-concept ukuze ngempumelene ngokulinganisile ukusebenza yabo ngezinsizakalo ezinxulumene ne-concept, okukhuthaza inefficiency ephezulu ye-sampling.
Ngo-analysis yethu, sinikeza okwengeziwe ukwahlukanisa izinhlamvu ezivela ku-pre-training data futhi ibonise ukuthi:
• Concept Distribution:Phakathi zonke i-pre-training datasets, ukusabalalisa kwezinhlelo zihlanganisa (bheka Fig. 5 ku-Section 5), okuyinto ibonisa ukuthi inani elikhulu lwezinhlelo zihlanganisa. Nokho, ngenxa yokushintshwa kwe-sampling ephakeme ebonakalayo, okuhlakaniphile isaziwa ngokufanelekileyo ngesikhathi se-multimodal pre-training.
• Concept Correlation across Pretraining Datasets:I-distribution ye-concepts phakathi kwezinhlayiyana ezivamile ze-pre-training datasets zihlanganisa kakhulu (bheka I-Tab. 4 ku-Section 5), okuvumela ukuba i-web crawls inikeza izinhlayiyana ezivamile ezivamile ezivela phakathi kwezinhlayiyana ezivamile ze-pre-training data curation strategies, okuvimbela izinzuzo ezivamile ze-rebalancing [11, 125].
• Image-Text Misalignment between Concepts in Pretraining Data:I-Concepts ikhona ngokuvamile eminyakeni eyodwa kodwa akuyona eminyakeni eyodwa, okuyinto inikeza i-desalignment emikhulu (bheka Tab. 3 ku-Section 5). I-Data Artifacts etholakalayo etholakalayo izinzuzo ze-image-text alignment e-scale ngokuvamile ngokuvumela izibonelo lapho izindlela ziye zihlanganisa. Qaphela ukuthi ukujula kwe-log-linear phakathi kwezinye izindlela zihlanganisa okuhle.
Ukuze inikeze izinga elula lokusebenza kwe-generalization ye-multimodal models, okuyinto isilawule isivinini se-concept ku-training set, sinikeza idatha entsha ye-test ye-lang-tail ebizwa ngokuthi“Let It Wag!”Izimodeli ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile ezivamile.
Izifundo ezidlulile [91, 46, 82, 42, 83, 74] zihlole umphumela we-pre-training data ekuphenduleni ukusebenza. I-Mayilvahanan et al. [79] zibonise ukuthi ukusebenza kwe-CLIP lihlanganisa ne-similarity phakathi kwe-training kanye ne-test datasets. Ezinye izifundo ezizodwa zebhizinisi ezithile ezifana ne-question-answering [62] kanye ne-numerical reasoning [94] emadolobheni amakhulu, i-high-train test set similarity ayidinga ngokupheleleyo izinga lokusebenza ezivame [127]. Ukuhlolwa kwethu okuhlobisa izinga eziningana ze-imaging-text datasets zihlanganisa kakhulu kulokhu, ngokuvumela (1) ukuthi i-concept frequency
Okuzenzakalelayo iyatholakala ku-archiv ngaphansi kwe-license CC BY 4.0 DEED.
Okuzenzakalelayo iyatholakala ku-archiv ngaphansi kwe-license CC BY 4.0 DEED.
I-Archive ye-Archive[1] Izigaba ze-classes zokusebenza ze-classification, ama-objects e-text captions zokusebenza ze-recovery, nama-objects e-text prompts zokusebenza ze-generation, bheka i-Section 2 ukuze uthole okwengeziwe mayelana ne-definition ye-concepts.